Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)#1170
Conversation
- NativeFlowMatcher: 393K-param OT-CFM velocity network with gated hidden-state correction - Legal score-first TTT: SGD lr=0.002, 10 epochs, freeze_blocks=2 - val_bpb: 1.11991 (sliding window stride=64, legal TTT) - val_bpb: 1.12312 (sliding window stride=64, no TTT) - Artifact: 15,745,776 bytes (254K headroom) - Single-seed (42) exploratory submission - Supplementary: eval logs, SLURM scripts, comparison data
- 2x2 matrix: NFM x TTT with base no-TTT baseline (1.12087) - Loss weight sweep: 0.01, 0.05, 0.1, 0.2 - Hidden dim sweep: 128, 256, 512 - 13 SLURM jobs submitted (6 train + 7 eval) - Results pending, will update when jobs complete
Ablation Studies Submitted13 SLURM jobs have been submitted to run comprehensive ablation studies for this NFM submission: 2×2 Matrix: NFM × Legal TTTIsolating the individual contributions of NFM and legal TTT at matched 7k steps.
NFM Hyperparameter SweepsLoss weight sweep (hidden_dim=256, seed=42):
Hidden dim sweep (loss_weight=0.1, seed=42):
Also pending
Results will be updated in README as jobs complete. |
- Training completed for seeds 42, 1337, 2025 (all 7k steps) - 3-seed mean sliding BPB (no TTT): 1.12252 ± 0.00151 - Seed 42: 1.12312, Seed 1337: 1.12367, Seed 2025: 1.12077 - Legal TTT eval jobs submitted (SLURM 55411651-55411654) - Added completed E2E TTT+Flow eval log (SLURM 55398555, BPB=1.12418) - Added training logs and SLURM scripts for all seed runs - Updated README with 3-seed results table and training trajectories - Updated submission.json with per-seed metrics and job IDs
- Three-seed legal TTT: mean 1.11928 ± 0.00146 (seeds 42, 1337, 2025) - 2×2 NFM×TTT matrix complete: NFM hurts by +0.002 (no-TTT) / +0.001 (TTT) - Loss weight sweep: lw=0.05 best but still +0.002 worse than base - Hidden dim sweep: hd=512 best but still +0.001 worse than base - Updated limitations section to reflect negative result conclusion
Community Review — Non-record: 11L NativeFlowMatcher + Legal TTT — val_bpb 1.1199 (3-seed mean no-TTT: 1.1225)BPB: 1.1199 | Compliance: FLAG — hashed n-gram cache with target-in-key (PR #779 family pattern) What I found in the code (head SHA The n-gram lookup key at line 1475 is constructed by XOR-ing the target token into the hash: This matches the Per Issue #1017 condition 1, Cluster context: this same structural pattern has been closed on 15+ PRs under the #779 ruling as of 2026-04-11 (#779 itself, #770, #798, #808, #825, #786, #797, #909, #940, #761, #776, #788, #774, #778, #715, #758, #702 upstream, #1488). The base neural model is unaffected by this flag — in every case where the authors resubmitted without the n-gram cache, the base val_bpb has been in the ~1.10-1.15 range (standard for the SP1024 11L class). CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS Verdict: COMPLIANCE FLAG — target-in-key hashed n-gram cache, same family as PR #779. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as the rest of the family-bug cluster. A context-only resubmission (drop the target from the lookup key and use a full-vocabulary reweighting from a single context row, per @valerio-oai's suggested legal path on #779) would be welcomed. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=115032 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Summary
Non-record submission exploring NativeFlowMatcher (NFM) — a 393K-parameter OT-CFM (Optimal Transport Conditional Flow Matching) velocity network that applies gated hidden-state correction to transformer hidden states, jointly trained with the AR objective. The Flow Matching module is trained as distribution transport, but used at inference as a small residual correction.
Results
Three-seed reproducibility (training-time sliding window, no TTT):
Primary (seed=42, with legal TTT):
Legal TTT gain: −0.00321 BPB
Architecture
Training
Ablation Studies
2×2 Matrix: NFM × TTT (isolating NFM contribution):
Base retraining is running. Loss weight sweep (lw=0.01, 0.05, 0.20) and hidden dim sweep (hd=128, 512) are queued.
Supplementary: E2E TTT + FlowRefiner 7k eval completed: legal TTT BPB = 1.12418.
Limitations
Credits
Base architecture (PR #549, @abaybektursun), Muon (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window (PR #50, @mattqlf), legal TTT (PR #77, @samacqua, PR #461 @Christopher-Lee-McClendon ), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), gated attention/value residual (PR #940), EMA (PR #65, @aquariouserworkman)
Checklist